Foundations of the Grammar of Graphics

Seminar: Data Visualisation: Principles, Practices and Applications


Julia Schulte-Cloos

University of Marburg


November 20, 2024

✉️ julia.schulte-cloos@uni-marburg.de

Grammar of Graphics (GG)

Why Grammar?

  • “The Grammar of Graphics” (Wilkinson 2012)
    • first edition: 1999
    • theoretical deconstruction of data graphics
  • “A Layered Grammar of Graphics” (Wickham 2010)
  • Good grammar?
    • insights into the composition of complicated graphics
    • reveals unexpected connections between seemingly different graphics
    • first step in creating a good sentence
  • Grammar tells us… (Wickham, Çetinkaya-Rundel, and Grolemund 2023)
    • statistical graphic is a mapping from data to aesthetic attributes (color, shape, size) of geometric objects (points, lines, bars)
    • \(\rightarrow\) combination of independent components that make up a graphic

Compositional Logic

Thomas Lin Pedersen 2020 (celebRation2020)

Compositional Logic

Thomas Lin Pedersen 2020 (celebRation2020)

Compositional Logic

Thomas Lin Pedersen 2020 (celebRation2020)

Compositional Logic

Thomas Lin Pedersen 2020 (celebRation2020)

Compositional Logic

Cedric Scherer 2023 (Posit Conference)

Data

  • your data for the plot
  • in tidy format
  • provides ingredients for your plot
  • tidyverse techniques to prepare data for optimal plotting format
  • one row for every observation that you want to plot

library(readr)
library(ggplot2)

bikes <-
  read_csv("./slides_data/london-bikes.csv",
    col_types = "Dcfffilllddddc"
  )

\(\rightarrow\) pass data to ggplot


ggplot(data = bikes)               # initial call + data

Aesthetic Mapping

  • link variables in data to graphical properties in the geometry
  • Aesthetics (aes), to make data visible
    • x, y: variable along the x and y axis
    • colour: color of geoms according to data
    • fill: the inside color of the geom
    • group: what group a geom belongs to
    • shape: the symbol used to plot a point
    • linetype: the type of line used (solid, dashed, etc)
    • size: size scaling for an extra dimension
    • alpha: the transparency of the geom

Christian Burkhart: https://drive.google.com/file/d/1Dvul1p6TYH6gWJzZRwpE0YX1dO0hDF-b/view
# scatter plot of bikes$count versus bikes$temp_feel
ggplot(data = bikes) +              # initial call + data
  aes(x = temp_feel, y = count)     # aesthetics

Geometries

  • how to interpret aesthetics as graphical representations
  • determines to a large extent the type of plot
  • some examples of geometries available in the ggplot2 framework
    • geom_point(): scatterplot
    • geom_line(): lines connecting points by increasing value of x
    • geom_path(): lines connecting points in sequence of appearance
    • geom_boxplot(): box and whiskers plot for categorical variables
    • geom_bar(): bar charts for categorical x axis
    • geom_histogram(): histogram for continuous x axis
    • geom_violin(): distribution kernel of data dispersion
    • geom_smooth(): function line based on data
  • each geom_ is a shortcut for a function called layer
# scatter plot of bikes$count versus bikes$temp_feel
ggplot(data = bikes) +              # initial call + data
  aes(x = temp_feel, y = count) +   # aesthetics
  geom_point()                      # geometric layer

Statistics

  • even though data is tidy it may not represent the displayed values
  • transform input variables to displayed values:
    • e.g., count number of observations in each category for a bar chart
    • e.g., calculate summary statistics for a boxplot
  • implicit in many plot-types but can often be done prior to plotting
  • linked to geometries
  • every geom has a default stat(istic)
  • layer can be created with a call to stat_*() or geom_*()
  • latter is more frequently used in dataviz community
  • pre-computed data: stat = 'identity'
ggplot(bikes, 
       aes(x = temp_feel, y = count)) +
  geom_point() + 
  # add a GAM smoothing
  stat_smooth() # also: geom_smooth()

Scales

\(\rightarrow\) translate back and forth between variable ranges and property ranges

position scales

  • used to control locations of visual entities in a plot & and how locations are mapped to data values
  • continuous scales (scale_x_continuous); special case of continuous scales: date scales (scale_x_date) vs. discrete position scales

colour scales

  • continuous color scales (scale_fill_viridis_c(); scale_fill_distiller()), discrete color scales (scale_fill_viridis_d(); scale_fill_brewer())
  • many other helpful R packages that provide color palettes (e.g., paletteer acting as a common interface for these different packages)

other scales

  • size, shape, line type, line width, manual scales, identity scales
ggplot(bikes, aes(x = temp_feel, y = count)) + 
  # color mapping only applied to points
  geom_point(aes(color = day_night)) + 
  # invisible grouping to create two trend lines
  stat_smooth(aes(group = day_night)) + 
  scale_color_viridis_d() + 
  # x axis
  scale_x_continuous(
    # add °C symbol
    labels = function(x) paste0(x, "°C"), 
    # use 5°C spacing
    breaks = -1:6*5  # also: seq(-5, 30, by = 5)
  ) +
  # y axis
  scale_y_continuous(
    # add a thousand separator
    labels = scales::label_comma(), 
    # use consistent spacing across rows
    breaks = 0:5*10000
  )

Facets

  • define the number of panels with equal logic and split data among them…
  • small multiples
  • should not be used to combine multiple separate plots
  • ggplot2 provide two facets for splitting data by categories
ggplot(bikes, aes(x = temp_feel, y = count)) + 
  geom_point(
    aes(color = season), 
    alpha = .5, size = 1.5
  ) +  
  stat_smooth(
    method = "lm", color = "black"
  ) + 
  scale_color_viridis_d(
    # overwrite legend keys
    labels = c("Winter", "Spring", "Summer", "Autumn")
  ) + 
  # x axis
  scale_x_continuous(
    # add °C symbol
    labels = function(x) paste0(x, "°C"), 
    # use 5°C spacing
    breaks = -1:6*5  # also: seq(-5, 30, by = 5)
  ) +
  # y axis
  scale_y_continuous(
    # add a thousand separator
    labels = scales::label_comma(), 
    # use consistent spacing across rows
    breaks = 0:5*10000
  ) +
  # small multiples
  facet_wrap(facets = vars(day_night)) # also: ~ day_night

plot <- ggplot(bikes, aes(x = temp_feel, y = count)) + 
  geom_point(
    aes(color = season), 
    alpha = .5, size = 1.5
  ) + 
  stat_smooth(
    method = "lm", color = "black"
  ) + 
  scale_color_viridis_d(
    # overwrite legend keys
    labels = c("Winter", "Spring", "Summer", "Autumn")
  ) + 
  # x axis
  scale_x_continuous(
    # add °C symbol
    labels = function(x) paste0(x, "°C"), 
    # use 5°C spacing
    breaks = -1:6*5  # also: seq(-5, 30, by = 5)
  ) + 
  # y axis
  scale_y_continuous(
    # add a thousand separator
    labels = scales::label_comma(), 
    # use consistent spacing across rows
    breaks = 0:5*10000
  ) +
  facet_grid(
    rows = vars(day_night), 
    cols = vars(year), 
    # free y axis range
    scales = "free_y", 
    # scale heights proportionally to length of y scale
    space = "free_y"
  ) +
  labs(
    # overwrite axis and legend titles
    x = "Average feels-like temperature", y = NULL, color = NULL,
    # add plot title and caption
    title = "Trends of Reported Bike Rents versus Feels-Like Temperature in London",
    caption = "Data: TfL (Transport for London), Jan 2015–Dec 2016"
  )

Themes

  • stylistic changes to the plot not related to data
  • can both apply complete themes or modify elements directly
  • themes are applied hierarchically
  • wide range of ggplot2 extensions with ready-built themes, e.g.:
    • {ggdark}
    • {ggsci} (also color scales)
    • {ggtech} (also color scales)
    • {ggthemes} (also color scales)
    • {ggthemr}
    • {hrbrthemes} (also color scales)
    • {tvthemes} (also color scales)

Isabelle Benabaye: https://isabella-b.com/blog/ggplot2-theme-elements-reference/
plot +
  # add theme with a custom font + larger element sizes
  theme_light(
    base_size = 11, base_family = "Spline Sans"
  )

plot +
  theme_light(base_size = 11, base_family = "Spline Sans") +
  # theme adjustments
  theme(
    plot.title.position = "plot", # left-align title 
    plot.caption.position = "plot", # right-align caption
    legend.position = "top", # place legend above plot
    plot.title = element_text(face = "bold", size = rel(1.2)), # larger, bold title
    axis.text = element_text(family = "Spline Sans Mono"), # monospaced font for axes
    axis.title.x = element_text( # left-aligned, grey x axis label
      hjust = 0, color = "grey20", margin = margin(t = 12)
    ),
    legend.text = element_text(size = rel(1)), # larger legend labels
    strip.text = element_text(face = "bold", size = rel(1.15)), # larger, bold facet labels
    panel.grid.major.x = element_blank(), # no vertical major lines
    panel.grid.minor = element_blank(), # no minor grid lines
    panel.spacing.x = unit(20, "pt"), # increase white space between panels
    panel.spacing.y = unit(10, "pt"), # increase white space between panels
    plot.margin = margin(rep(15, 4)) # adjust white space around plot
  )

Coordinates

  • what kind of canvas should the final data be drawn on?
    • i.e., how should x and y be interpreted
  • Limits and transformation can be applied in scale or in coord

Application 1

Application 1

Work through the 02_foundations_exercises.qmd together with your neighbor, running the code yourself and trying to understand the syntax behind the Grammar of Graphics.


➡️ For each of the chart types involved (e.g. bar chart, scatter plot, etc.) consult the excellent book Fundamentals of Data Visualisation by Claus Wilke to develop an intuition of good and bad practices of each type of chart.


➡️ You may also want to consult the book’s source code to check out the code behind the best practice examples.

60:00

Application 2

Application 2

Rely on some data that is relevant to your field and subject of your PhD. Try to visualize some of the data using the Grammar of Graphics ggplot2 framework. You may want to consider visualizing distributions, proportions, associations, and uncertainty.


🤓 Try to incorporate the “Principle of Proportional Ink” into your visualizations.


➡️ If you feel unsure about a potential dataset to look at, explore the bike sharing data set introduced earlier and challenge yourself by completing Exercise 2.

60:00

Questions? 🙋 🙋‍♂️

Thanks for your attention.

Wickham, Hadley. 2010. “A Layered Grammar of Graphics.” Journal of Computational and Graphical Statistics 19 (1): 3–28.
Wickham, Hadley, Mine Çetinkaya-Rundel, and Garrett Grolemund. 2023. R for Data Science. " O’Reilly Media, Inc.".
Wilkinson, Leland. 2012. The Grammar of Graphics. Springer.